options(scipen=999)

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
setwd("~/Desktop/ppaf24/homework/hw5")

#LA bikeshare q3 data (july through september)
metrodat <- read.csv("metro-trips-2022-q3.csv")

Los Angeles is a large city with a wealth of culture and activities for residents and visitors alike. Between Metro’s public transportation options and Bike Share system, our City provides multiple affordable options for anyone traveling through. Metro Bike in particular is a growing part of the system, with 225 stations and counting throughout the city, mostly in downtown and coastal hotspots. We anticipate that ridership will continue to grow every year, but in order to encourage use of the system we must take steps to make the system as easy to use and reliable as possible. One critical consideration is the re-balancing of bikes across the network. Riders may not want to use the system if they are not confident that they will have somewhere to dock their bike at the end of their trip.

The pleasant climate with little precipitation and proximity to beaches make Los Angeles a popular summer destination. This report takes a closer look at use of the system during the summer months of July through September in order to predict demand for certain stations, taking into special consideration the two major holidays during these months: Independence Day and Labor Day. Understanding demand in this popular destination, for residents commuting using Metro Bike and visitors exploring the city, will inform the recommendations for rebalancing and ensuring a dependable system for all users.

metrodat2 <- metrodat %>%
  filter(passholder_type != "Testing") %>%
  mutate(interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
         interval15 = floor_date(mdy_hm(start_time), unit = "15 mins"),
         week = week(interval60), month=month(mdy_hm(start_time)),
         dotw = wday(interval60, label=TRUE))
laxCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2022, 
          state = "CA", 
          geometry = TRUE, 
          county=c("Los Angeles"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

laxTracts <- 
  laxCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

metro_census <- st_join(metrodat2 %>% 
          filter(is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(start_lon) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        laxTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., laxTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)
#using dtla station since LAX is close to the ocean
weather.Panel <- 
  riem_measures(station = "CQT", date_start = "2022/07/01", date_end = "2022/09/30") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

City and System Characteristics

The following charts give an overview of the system’s performance.

The weather panel from the Downtown LA weather station shows consistently warm temperatures. There are only a few days in September where precipitation occurs and temperatures drop. For this reason, this analysis will not consider precipitation as a main predictor for demand, though temperature and wind speed will be considered as they differ by time of day.

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Downtown Los Angeles - July-September 2022")

The next few charts look at trip characteristics.

ggplot(metro_census %>%
         group_by(month,interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Los Angeles, 2022",
       x="Date", 
       y="Number of trips")+
  facet_wrap(~ month, scales = "free", nrow=3)+
  plotTheme

When looking at ridership by month, we can see that August and September had more riders overall, with one day in August having a large spike among an otherwise steady flow of trips.

metro_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Los Angeles, July-September 2022",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

Most stations have a low number of hourly trips no matter the time of day with a few exceptions, like a few stations with more trips during rush hour.

grid.arrange(nrow=2,
ggplot(metro_census %>% mutate(hour = hour(interval60)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Los Angeles, by day of the week, July-September 2022",
       x="Hour", 
       y="Trip Counts",
       color=NULL)+
     plotTheme,


ggplot(metro_census %>% 
         mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  scale_color_manual(values=c("#92BCAB","#123F5A"))+
  labs(title="Bike share trips in Los Angeles - weekend vs weekday, July-September 2022",
       x="Hour", 
       y="Trip Counts",
       color=NULL)+
     plotTheme)

Day-by-day, there is not a great amount of variation in travel trends by hour except for weekends have a longer range of hours where people take trips during the day, with weekdays seeing their peak in the evening hours. Looking strictly at differences between weekday/weekend, more trips are made on weekdays than the weekend. Again, the weekend has more hours with a high number of trips, while weekdays see a peak in later hours.

Ridership Maps

The following maps explore time of day/time of week patterns spatially.

ggplot()+
  geom_sf(data = laxTracts %>%
          st_transform(crs=4326), fill="grey95", color="grey80")+
  geom_point(data = metro_census %>% 
            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 0.5, size = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma", name=NULL)+
  ylim(min(metro_census$from_latitude), max(metro_census$from_latitude))+
  xlim(min(metro_census$from_longitude), max(metro_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Los Angeles, July-September 2022")+
  mapTheme

From this map, we see that the parts of the network with the highest ridership are consistently along the coast and downtown. For both weekdays and weekends, ridership is highest mid-day (10am-3pm) and during the PM rush (3pm-6pm), though some stations see high ridership overnight.

LAmetrostations <- st_read("https://public.gis.lacounty.gov/public/rest/services/LACounty_Dynamic/LMS_Data_Public/MapServer/185/query?outFields=*&where=1%3D1&f=geojson")

metroBuffers <- 
  st_union(st_buffer(LAmetrostations, 805)) %>%
  st_sf() %>%
  st_make_valid()
ggplot()+
  geom_sf(data = laxTracts %>%
          st_transform(crs=4326), fill="grey95", color="grey95")+
  geom_point(data = metro_census %>% 
            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 0.5, size = 0.6)+
  geom_sf(data=metroBuffers, fill=NA, color="firebrick")+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma", name=NULL)+
  ylim(min(metro_census$from_latitude), max(metro_census$from_latitude))+
  xlim(min(metro_census$from_longitude), max(metro_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Los Angeles, July-September 2022",
       subtitle="Red circles represent half mile buffer from LA Metro Rail Stations")+
  mapTheme

As Metro Bike is part of Metro’s multimodal system alongside bus and rail lines, another important consideration is proximity to transit. A half mile buffer for Metro’s rail lines is created to assess whether the stations with the highest demand are located along the transit network. Are users transferring to/from Metro lines?

Though most of the stations with the highest ridership along the coast are not located along the Metro network, the high ridership area of downtown is. This suggests that transit use should be a consideration when making predictions.

metro_census <- metro_census %>%
  mutate(plan_duration = case_when(plan_duration==1 ~ "Day",
                                   plan_duration==30 ~ "Month",
                                   plan_duration==365 ~ "Year"))

ggplot()+
  geom_sf(data = laxTracts %>%
          st_transform(crs=4326), fill="grey95", color="grey80")+
  geom_point(data = metro_census %>% 
            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, from_latitude, from_longitude, weekend, time_of_day, plan_duration) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = plan_duration, size=n), 
            fill = "transparent", alpha = 0.7)+
  scale_color_manual(values=c("goldenrod2","royalblue3","firebrick"), name="Plan Duration")+
  scale_size("Number of Trips")+
  ylim(min(metro_census$from_latitude), max(metro_census$from_latitude))+
  xlim(min(metro_census$from_longitude), max(metro_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Los Angeles, July-September 2022")+
  guides(color = guide_legend(override.aes = list(size=5)))+
  mapTheme

The final map looks at the duration of passes for trips. This gives us a sense of whether users are consistently using Metro as their commute mode or if the trips made are spontaneous or leisure rides. The map shows that many trips originating from downtown and West LA are users with monthly or annual passes for Metro Bike. Stations along the coast have more users who purchased day passes.

Predicting Metro Bike Demand

study.panel <- 
  expand.grid(interval60=unique(metro_census$interval60), 
              start_station = unique(metro_census$start_station)) %>%
  left_join(., metro_census %>%
              select(start_station, Origin.Tract, from_longitude, from_latitude,plan_duration)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))


#[1] 483075
#[1] 483075
ride.panel <- 
  metro_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, from_longitude, from_latitude,plan_duration) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)
ride.panel <- 
  left_join(ride.panel, laxCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 185 | yday(interval60) == 248,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

The first step in predictions is creating a time/space panel for analysis. This panel is engineered to include fixed effects like demographics based on the Census tract that the station is located in and hour/day/holiday effects. The panel also includes time lags to more accurately assess demand in the hours around a given time interval, putting the trip into context rather than treating it as an isolated event and assuming that a trip is independent of the time or place that it occurs (this would simply be untrue). There is also a lag effect created for each of the holidays, with effects of 3 days before and after the holiday to see if there are differences in network use. In 2022 both of these holidays fell on a Monday, making a 3-day weekend.

This analysis is particularly focused on ridership in the summer months and during holidays. Our models are trained on the last two weeks of August and the first two weeks of September, for a total 4 week training set including the Labor Day holiday. The test set uses three weeks in July, including Independence Day, to make predictions.

ride.Train <- filter(ride.panel, week >= 34 & week <=37)
ride.Test <- filter(ride.panel, week >= 27 & week <= 29)

Five linear models are created to predict the number of trips for each station, each testing a different combination of variables.

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature, data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Wind_Speed, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + plan_duration, 
     data=ride.Train)
  • Model 1 predicts using the hour the trip was made, the day of the week, and the temperature.
  • Model 2 predicts using the origin station, day of the week, and temperature.
  • Model 3 predicts using the origin station, the hour the trip was made, day of the week, temperature, and wind speed.
  • Model 4 predicts using the origin station, the hour the trip was made, day of the week, temperature, and the time lag variables.
  • Model 5 predicts using the origin station, the hour the trip was made, day of the week, temperature, time lag variables, holiday, holiday lag, and plan duration.
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}
week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

Looking at each model side-by-side, we can see that models 4 and 5 have the lowest Mean Absolute Error. This suggests that the time lags enhanced the predictive power of the spatial variables like the origin station.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      scale_color_manual(values=c("#92BCAB","#123F5A"))+
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Los Angeles; 3 week test set",  x = NULL, y= "Station Trips") +
      plotTheme

The time series of each regression confirm that models 4 and 5 make the most accurate predictions. Though they are very similar, model 5 performs slightly better around the holiday. Model 5, which includes spatial effects, time lags, and holiday fixed effect, is the optimal model with the most predictive power. However, it does miss the peaks of high demand seen in the observed outcomes.

The time series and MAE plots do not tell the full story, though. As we saw in the above plots and maps, ridership varies by station, type of user, and where their trips originate. Looking at our predictions in space will offer a more complete picture of where to improve our model and where to focus the rebalancing plan.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(start_station, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = laxCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.6)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(metro_census$from_latitude), max(metro_census$from_latitude))+
  xlim(min(metro_census$from_longitude), max(metro_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme

When looking at the mean absolute error in space, the highest errors are where we saw the most ridership in the above maps: downtown and near the coast. This suggests that there is something missing in our model to account for the users whose trips originate in this area.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    #geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

The differences between observed trips and predicted trips by time of day offers insight into the nature of the errors in the model. The predicted trips only go as high as 6 trips for each time of day, but some times of days have up to 15 trips. Though many predictions are close to accurate, the model fails to capture the higher ridership numbers at times of day other than the AM rush, under-predicting many of the trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = laxCensus, fill="grey95", color="grey80")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.6, alpha = 0.5)+
  scale_colour_viridis(direction = -1, 
  discrete = FALSE, option = "plasma")+
  ylim(min(metro_census$from_latitude), max(metro_census$from_latitude))+
  xlim(min(metro_census$from_longitude), max(metro_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

Looking at the errors spatially and by time of day, the trend continues of the highest errors being in downtown and coastal stations. The errors are most pronounced during mid-day and the PM rush.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_White = map(data, pull, Percent_White),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Med_Age = map(data, pull, Med_Age)) %>%
    select(interval60, start_station, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_White, Med_Inc, Med_Age,Percent_Taking_Public_Trans) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "Mid-Day") %>%
  group_by(start_station, Percent_White, Med_Inc, Med_Age,Percent_Taking_Public_Trans) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

Some errors could be explained by demographics of the areas around a station, since these variables were not considered when making predictions. The above chart plots the mean absolute error of trips as a function of various socioeconomic variables: median age, median income, percentage of public transit users, and percent white population. These demographics are at the tract level from ACS 2022 estimates.

Our model shows some stations that the model did not work well with for the different demographics. Stations located in tracts with a median age of 35-45 have higher errors. Stations located in areas with higher incomes ($75,000 or higher) also had higher errors. The percentage of transit users was relatively low across stations, but the stations where people in the area use transit the least had higher errors. Finally, stations located in majority-white areas had higher error.

Rebalancing Plan

The optimal model performed well in terms of accuracy, but did not generalize well to high-demand areas, often under-predicting. This calls attention to ridership characteristics that are not captured within our data. A few possible reasons for the errors:

  • Users in high-demand areas like the coast and downtown may be visitors and not adhering to common time trends like rush hour
  • Weather near the beaches is different than in downtown, meaning that weather effects may differ between the two areas
  • The system is spatially disconnected and the two main clusters may not be entirely comparable to one another

Despite the errors, our model still performed accurately and our analysis identified areas to focus on for rebalancing. Given that the main weakness of the model was predicting high demand, a larger training set would be beneficial in increasing predictive power. Exploratory analyses and analyses of the errors identified the same high demand stations as those with larger errors. We are not capturing the demand with our current modeling approach, so our rebalancing should focus on these areas, incorporating our findings about the spatial and temporal variations in demand. Given the size of the city, the most effective rebalancing method would be transporting bikes to high-demand stations via truck to avoid unnecessary burden on users.

Our coastal stations serve many users with daily passes, with the most demand in the mid-day period. Many stations in Venice are close together and the popular Santa Monica station is nearby. A truck visiting just prior to the mid-day would be able rebalance as needed among the stations, ensuring an adequate number of bikes in the stations closest to the beach and along Venice Boulevard are available as the mid-day peak comes. This should be done again in the afternoon as we see a moderate number of trips in the overnight time period.

For downtown, the stations also close together and more abundant than the stations near the beaches. In the core of downtown, this gives users more places to dock reliably, and ridership does not reach high numbers like on the coast. High-demand stations peak in the PM rush, but ridership is more moderate in other time periods. For these reasons, twice-daily rebalancing during peak hours is not necessary, especially since users in this area tend to have monthly or yearly passes and are likely familiar with the system. However, there should absolutely be a daytime rebalancing prior to the PM rush, making sure that all stations have an adequate supply of bikes. This will make it easier for those beginning their destination during the peak time to get a bike at the most convenient location.

Though the algorithm isn’t completely generalizable and under-predicts high demand, we still identified important spatial and temporal trends in our network to inform our decision-making going forward. As ridership grows, a more advanced training set with more features to capture the differences in the stations will be crucial for a more precise rebalancing plan, but what we know about the trends for late summer Metro Bike use will help us to start being proactive in meeting demand.